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Abstract 

The past decade has experienced a phenomenal growth in the amount of data and resultant infor- 
mation generated by NASA’s operations and research projects. A key application is the reprocessing 
problem which has been identified to require data management capabilities beyond those available 
today [PRAT93]. The Intelligent Information Fusion (IIF) system [ROEL91] is an ongoing NASA 
project which has similar requirements. Deriving our understanding of NASA’s future data manage- 
ment needs based on the above, this paper describes an approach to using parallel computer systems 
(processor and I/O architectures) to develop an efficient parallel database management system to 
address these needs. Specifically, we propose to investigate issues in low-level record organization and 
management, complex query processing, and query compilation and scheduling. 


1. Problem Understanding: NASA’s Future Data Management Needs 

The past decade has experienced a phenomenal growth in the amount of raw data and resultant 
information generated by NASA's operations and research projects [JJ.0EL91]. The need for significant 
improvement in information technologies to manage, identify, and access this data has been clearly identi- 
fied [ROEL91, CROM92, CAMP90a, CAMP90b]. This section present’s our view of NASA’s future data 
management needs (at least in part). It is based on (i) the description of the reprocessing pro6/em given in 
[PRAT93], (ii) published descriptions of the Intelligent Information Fusion (IIF) system [R0EL91], and 
(iii) miscellaneous NASA publications. 

1*1 A View of NASA’s Data Management Architecture 

Figure 1 shows the schematic of a system architecture where the principal emphasis is on the path 
data takes, and the transformations it goes through, from sensor collection to the scientific user. This 
architecture borrows from that of the IIF system [R0EL91]. The aim of this diagram is principally for 
problem understanding purposes and to establish a context for the subsequent discussion. It is by no 
means a proposal of what the complete architecture for NASA’s data management system should be, and 
is much wider in scope than that of the present paper. 

Sensor data first goes through some very low-level processing to generate ’raw data’ [PRAT93] which 
is stored in a Parallel Raw Data Archive (PRDA). The reprocessing activity creates ’data products’ 
[PRAT93] which are managed by a Parallel Relational Database Management System (PRDBMS). Meta- 
data about both raw data and data products is stored in a Metadata Database (MDB). The three different 
types of data stores, i.e. the PRDA, PRDB, and MDB, reflect the three basically different types of usage 
of the data and metadata in such an environment [PRAT93]. The raw data is expected to be used mostly 
by reprocessing algorithms running on vector supercomputers and massively parallel processors (MPPs), 
and hence is shown managed by a high-performance file system. Since existing data products can also 
be inputs to the reprocessing activity [PRAT93], direct access to the Parallel Record Management Layer 
(PRML) of the PRDBMS by the machines running the reprocessing algorithms is shown. A typical 
user of the data products is a remote scientist who logs in and browses the metadata searching for data 
relevant to a research project. While most browsing involves interaction with the metadata, the scientist 
may periodically access data products as well as raw data to identify interesting data. Upon selecting 
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Figure 1 . Our View of NASA Data Management Architecture 
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the needed data, the appropriate portion is downloaded into the scientists home location. To support 
such pattern of user behaviour the MDBMS should support large numbers of interactive browsing ses- 
sions, each posing mostly small queries against the MDB, interspersed with occasional queries against the 
PRD A or PRDB. While interactive response time is needed, the bandwidth required is expected to be 
small during the browsing. Once browsing is complete, the user will issue a series of requests to extract 
the data to be downloaded to his home location. These requests can be SQL queries to the PRDBMS 
or file access requests to the PRD A. These requests are expected to have high bandwidth requirements 
since a large volume of data may be extracted. Since execution times of different plans for a SQL query 
can differ by a few orders of magnitude, query optimization is critical to ensure both interactive response 
time and reduced system workload. 

1.2 Parallel I/O: Key to NASA’s Data Management 

Given the volume of data/information in NASA’s applications, the use of multiple disks for storage is 
well accepted. In a database processing environment, the fact that disk I/O is the main bottleneck has 
been a consensus among researchers. Recent years have seen phenomenal increase in processor speeds, 
while the ’disk access time’ has not shown much improvement, exacerbating the ’access gap’ problem. The 
advent of multiple processor machines has added to this problem. Fortunately, the computer architecture 
community has started addressing the needs of data intensive applications by developing parallel I/O 
architectures, e.g. Redundant Array of Inexpensive Disks (RAID) [PATT88] and Disk Arrays [GORD91]. 
This promises future parallel I/O systems which can feed data to the multiprocessor at a high sustained 
bandwidth. 

Along with the development of parallel I/O hardware, there is a need to develop efficient parallel I/O 
algorithms to exploit their full potential. The main focus of research in parallel algorithms has been on 
main memory resident data, where processor parallelism has been of primary concern [LEWI92]. With 
I/O bandwidth being a principal concern, high performance parallel databases require parallel algorithms 
for disk resident data. Parallel processing of database operations was first addressed by the database 
machine community, where the focus was on designing special-purpose hardware [SU86]. No single 
architecture was found suitable for all database applications, and the cost of building special purpose 
hardware for specific applications led to only limited success in this direction [DeWI92]. In the past few 
years there has been renewed interest in looking at database issues for general purpose parallel machines. 
The availability of a variety of commercial parallel machines, which has eliminated the expense of building 
special purpose hardware, is in large measure responsible for this [DeWI92]. 

A crucial factor in our choice of the relational model for the PRDBMS component of the architecture 
in Figure i is that the set-oriented, non-procedural nature of the relational model provides opportunities 
for massive parallelization [DeWIT92]. This choice is further supported by the fact that the IIF system 
has already proposed using a relational DBMS for its low-level record management system (LLRMS) 
[ROEL91]. 

1.3 Scope of Our Project 

Realization of the architecture shown in Figure 1 is a major task and requires research and development 
in many areas. The scope of our project is limited to addressing problems in the PRDBMS component 
of the system. Specifically, we address the following problems: 

• Data organization, loading, sorting, and retrieval, and index creation and maintenance, in the 
Parallel Record Management Layer. The proposed solutions must consider that access requests 
to this layer will be a mix of (i) very high rate of large size access requests from the reprocessing 
algorithms, and (ii) low to medium to sometimes large size requests from the upper layers of 
PRDBMS. 

• Parallel algorithms to support expensive operations, e.g. join, union, etc., in the Parallel Relational 
Algebra Layer. 
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• Compilation and optimization of SQL queries, and resource allocation and scheduling of operators 
in the resultant plan. Minimizing response time and maximizing throughput will be considered as 
the optimization criteria. 

The rest of the paper is organized as follows: Section 2 presents the technical details of our approach. 
Section 3 presents a list of goals that must be met, including specific technical problems that must be 
solved, to make such a system a reality. Section 4 provides the conclusions and section 5 contains the list 
of references. 

2. Technical Details of the Proposed Approach 

Our overall goal is to investigate techniques for building a parallel database engine which could fulfill 
the needs of the PRDBMS component of Figure 1. Following are the key ideas behind our approach: 

• Tuples in a relation (or records in a file) are modeled as points in a multi-dimensional space, with 
each attribute representing an axis. 

• This multi-dimensional space can be divided into (overlapping or non-overlapping, nested or non- 
nested) subdivisions. 

• The subdivisions are allocated to different I/O units (e.g. disks) of a parallel computer, with usually 
many subdivisions going to a single unit, and possibly a single subdivision replicated on multiple 
units for reliability. This has been termed dcclustering [DEWI90]. The aim is to provide good (close 
to optimal) load-balancing for query processing. 

• New dcclustering-aware parallel algorithms for basic data retrieval operations, e.g. relation/file 
scan, as well as complex operations, e.g. join and sort, are built to take advantage of the underlying 
declustering. 

• The query compiler/parallelizer/scheduler takes considers architectural parameters and decluster- 
ing information, in addition to the traditional query and database parameters, in minimizing the 
execution plan cost. In addition, it generates an initial resource allocation schedule for plan execu- 
tion. 

The remainder of this section is organized as follows: Section 2.1 presents an architecture for the 
PRDBMS. Sections 2.2 through 2.4 describe our approach to solving specific problems in the record 
management, relational algebra , and query compilation layers of the PRDBMS. 

2.1 Parallel RDBMS Architecture 

As shown in Figure 1, the PRDBMS has a layered architecture. The parallel record management layer 
provides the abstraction of relations/tables which can be created, deleted, populated, sorted, and on 
which simple selections (predicates involving single relations only) can be performed. This abstraction is 
used both by the higher layers of the PRDBMS and by the reprocessing algorithms. The parallel relational 
algebra layer contains algorithms for complex operations such as join, union, difference, aggregation, etc. 
It uses the abstractions provided by the record management layer. The query compilation layer provides a 
declarative interface (SQL) to PRDBMS users (the intelligent front-end and metadata manager in Figure 
1), and does the necessary translation and optimization of declarative queries into a sequence of relational 
algebra operations. 

2.2 Parallel Record Management Layer 

The parallel record management layer uses the services offered by the operating system to provide an 
abstraction of relations/tables containing records. 

2.2.1 Requirements 
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We first identify the characteristics of data stored in the record management system as well as of the 
retrieval requests on it. Datasets for many large-scale scientific applications, including those of NASA, 
exhibit the following characteristics [R0EL91, CAMP90a]: 

• The basic data unit is an observation, e.g. from a satellite, with various attributes such as latitude, 
longitude, temperature, time, etc. 

• The data is multi-dimensional, e.g. the three spatial dimensions, the temporal dimension, and 
various other attributes. 

• The database is fairly stationary, i.e. new data can be appended or results of analyses can be added. 
However, the basic data once added is rarely, if ever, updated. 

• High speed and volume of reprocessing requires support for efficient creation and population of 
relations, both in terms of bandwidth and response time. 

• A very high rate of large size retrieval requests is expected from reprocessing algorithms. Large size 
requests are also expected from the intelligent front-end working on the users’ behalf, albeit not at 
quite the same rate as reprocessing algorithms (though it really depends on user load). 

2.2.2 Approach 

In the following we describe our approach to the specific problems listed below. Comparisons with 
related work are included where appropriate. 

• Data declustering, i.e. partitioning a file of records across multiple disks of a parallel I/O system. 

• Parallel algorithms for range query processing on a single relation/table. 

• Parallel algorithms for loading large data files into relations/ tables. 

Unit datum is modeled as a tuple/record whose attributes/fields represent various facets of the datum 
such as latitude, longitude, temperature, time, etc. Relations/ Files, i.e. a collection of records of the 
same type, model sets of observations of the same type. A general request on a collection of observations 
of the same type is modeled as a multi-attribute range query, with predicates defined on one or more 
attributes. 

Let Di (1 < i < d) be an ordered set. A record is an ordered d-tuple (r x , 1 * 2 , ..., r<*) gDix£) 2 x...x Di. 
Di is defined to be the domain of the i (/l attribute, and r* is the value of the i th attribute of the record. 

A d-dimensional file, F, is a non-empty set of records, stored on a parallel disk system with M disks. 

The most general retrieval operation, the range query, is denoted by Q = ([Li,{7i), [Ld% Ud))% with 
[L f , Ui) being the desired range on the i th attribute. The answer to the range query Q is A(Q) = {(r lt r d ) 
F | Li < r, < £/,-, 1 < * < d}. Note that the exact-match query and the partial-match query can be 
treated as special cases of the range query. For a query Q , let Worki(Q) be the number of blocks required 
from disk i to answer the query, 1 < « < M , and let Work(Q) = Worki(Q) be its total work. 

Assuming parallel operation of individual disk units, and the performance” of the I/O subsystem being the 
critical factor in system performance - which is a reasonable assumption given trends in parallel machines, 
the response time of the query is Rsp(Q) = MAX x <i<M {Worki(Q)}. The optimal (minimal) response 
time for the query Q by distributing data over M disks is then \Work(Q)/M]. 

Now, the data declustering problem for a parallel record management system is to develop a strategy 
such that it provides (i) optimal parallelization of individual queries (speed-up) as well as (ii) good 
parallelization of all possible queries (robustness). In the last few years a number of declustering strategies 
have been proposed [DeWI90, GHAN91, GHAN92, HUA91 LI92, FAL093, ABDE93]. A survey of a 
some of these is given in [DeWI92]. The focus of [DeWI90, GHAN91] is to decluster based on a single 
attribute, thus improving performance only of queries containing a predicate on that attribute. [GHAN92] 
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improves upon their previous proposal by selecting a typical query and using information about it to 
improve declustering. [HUA91] considers multiple attributes but optimality is not addressed. [ABDE93, 
FAL093] identify specific subsets of queries for which their schemes have optimal performance, but the 
issue of robustness is not addressed. Our work [LI92] has developed the Co-ordinate Modulo Deelustenng 
(CMD) t echni ques (i) which is optimal for a very large percentage of all possible single relation SQL 
queries, (ii) has a small deviation from optimality for the rest, and (iii) whose deviation from optimality 
decreases as the size of the query result grows. Complete details of CMD and its comparison with other 
schemes is given in [LI92]. Here we provide a brief overview. 

For illustration assume that all files are subsets of the unit space S = [0 , 1) , d > 1. Divide each 
dimension of S into nM equal sized intervals for some integer n: 

[0, 1/nM), [1/nM, 2/nM) [(nM - l)/nM, 1). 

Let the X th interval of the k fii dimension be denoted by Ik$ = [tkit hki) = [ifnM , (i + l)/nM), for 0 < 
i < nM — 1, with its interval coordinate t ic*, being i. Given a region hi»j) x [hi*i x 

[Ui 4 , hdi 4 ) of S, its region coordinate , rc, is defined to as an ordered set of its interval coordinates, i.e. 
rc = (ici, 1 C 2 , ..., ic<*). Now, a region, i.e. partition of the multi-dimensional data space, with region 
coordinate rc is assigned to disk C M D(rc, M), where the allocation function CMD is defined as. 

CMD(rc y M) = (ici + ic 2 + ... + «*) rnod M . 

Example 1: Let S = [0, l) 2 , M = 4 and n = 2, i.e. each dimension is divided into 8 intervals with 
length 0.125 each. The partitions of 5 and their allocation to disk unite is shown in Fig. 2. 



Figure 2. The partition and allocation of S = [0, l)x [0,1) among 4 disks with M = 4 and n - 2 

We have developed parallel algorithms for multi-dimensional range queries on data with CMD par- 
titioning. The following theorems describe the key properties of the algorithms. Proofs are given in 

[LI92]. . . 

Theorem 1 (Speedup): The CMD method is optimal for all range queries whose length, in terms of 

the number of regions covered, on some dimension is equal to kM where k is an integer. 

Corollary 2.1. The CMD method is optimal for all range queries in which at least one attribute 
is unspecified (since the query length on that attribute is the complete range, automatically an integral 
multiple of M). 

Example 2: Consider auery Qi = ([0.000,0.375), [0.250,0.750)) in Figure 2. Assuming each 

region can be fetched in a single disk access, Work(Q\ ) = 12 disk accesses. Since exact y accesses 
need to be made to each of the disks 0, 1,2,3, the response time for Q i is optimal. The condition 
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in Theorem 1 is sufficient but not necessary since optimal response time is also achieved for query 
Qi = ([0.625, 1.000), (0.250, 0.500)). 

Theorem 2 (JZoiusfness): For any arbitrary range query Q the response time, Rsp(Q ), is bounded 
by \Work(Q)/M] + (M - l) d “ l - 1. 

Theorem 2 gives an approximate upper bound, and the actual performance of CMD is much better. 
For example, for 2 and 3 dimensions the worst case upper bounds are Af/4 and M 5 /16, respectively. 
Note that range queries usually examine a very large subspace of 5, i.e. Work(Q) is usually large. Thus 
\Work(Q)/M T, the fraction that is optimal, is much more significant than (M — l) d ~ l — 1. 

Parallel Data Loading Algorithms: Our recent work [LI93] is developing efficient parallel algo- 
rithms for loading files of records into a CMD format. Initial results show that almost linear speedup of 
the process, in terms of the number of disk units, is achievable. Detailed algorithms and their properties 
are discussed in [LI93]. 

2.3 Parallel Relational Algebra Layer 

The parallel relational algebra layer contains algorithms for complex operations such as join, union, 
difference, and aggregation. It uses the abstractions provided by the parallel record management layer. 

2.3.1 Requirements 

Descriptions of various NASA projects, including the Intelligent Information Fusion (IIF) system 
[ROEL91, CAMP90a], the Intelligent User Interface for Catalog Browsing system [CROM89], etc., have 
identified the need for performing complex comparisons across different types of data sets. Thus, the 
requirements for this layer are: 

• Efficient algorithms for complex operations such as join, union, set difference, etc. 

• Efficient algorithms for various kinds of aggregate operations. 

• Since space and time are special types of attributes, correlations on them can be potentially treated 
in a more specialized and efficient manner, e.g. by supporting temporal joins [McKE92]. 

2.3.2 Approach 

In the previous section we presented results about the efficacy of the CMD approach in processing 
queries accessing a single relation. A vast body of work [DeWI92, WOLF91, FRIE90, SCHN89, DeWI92, 
CHEN92, NICC92] has shown that join continues to be one of the most expensive relational operations in 
the parallel environment. Our recent work [NICC92] has shown that an approach to achieving efficiency 
for complex database operations in a parallel environment is to make them declustenng aware , i.e. an 
algorithm implementing a complex operation (e.g. join) will perform better if it is aware of the underlying 
declustering strategy. [NICC92] describes and analyzes in detail the benefits of making hybrid-hash join 
algorithm [DeWI84] aware of CMD declustering. We outline the approach here. 

For a relation stored using CMD declustering, we define the following: 

Definition (Join Axis, b): The axis of the multi-dimensional space representing the join attribute 

V. 

Each interval (/» , A* ) along the join axis denotes a subrange of the join attribute domain. 

Definition (Joining Region , JR(R t B,i)): The d- 1 dimensional subspace, of the d dimensional space, 
created by fixing the subrange of the join axis, 6, to have values in the interval (/,, A,) and allowing the 
other axes to be free. 

JRfi %i is the i x h joining region of relation R along attribute axis a. 
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As shown in Figure 3(a), consider R and S as relations to be joined on attribute 6. J R{R, 6,2) and 
J R(S y 6, 1) are example joining regions of relations R and 5, respectively. A joining region of R must join 
with every joining region of S with which it overlaps on the join axis. Thus, JR(R,b, :) and JR(S J b ) j) 
must be joined iff: 

(/,* < h < hi)or(h < hj < hi) 

The following results describe the properties of our declustering aware approach, details of which are 
presented in [NICC92]: 

Theorem 3: If there is enough aggregate buffer memory, i.e. among all processors together, to hold 
the largest joining region of the smaller relation, plus one disk block per processor, then no data need be 
read from the I/O system more than once. 

Corollary: There exist cases where declustering aware algorithm will read a disk block exactly once 
while a non declustering aware algorithm will read it more than once. 

In addition to reducing disk accesses, a declustering aware algorithm may entirely eliminate part of 
the computation, by skipping over entire joining regions of either relation, if there is no intersecting 
joining region of the other relation, as shown in Figure 3(b). Essentially, a declustering- aware approach 
to query processing has the following advantages: 

• A large problem is broken down into a set of subproblems, such that the sum of the work for 
the set of subproblems is usually lesser than that for the original. For example, the work for an 
equi-join between relations R and 5, with sizes |/?| and (51 respectively, is roughly proportional to 
|/Z||5|, say with a nested-loops join. If, however, the join axis has k partitions, a declustering-aware 
nested-loops algorithm is required to do only t(|/Z||S|)/i 2 total work for the k subproblems. 

• The performance of most database algorithms, e.g. join, sort, etc., is highly sensitive to the amount 
of main memory buffer available, with performance often increasing dramatically as the ratio 
Buf ferSize/ ProblemSize increases (CHOU86, YU93]. For a given amount of aggregate main 
memory buffer (of the parallel machine), breaking a problem into smaller subproblems has the net 
effect of increasing this ratio. 

• Skewed data distribution causes serious performance problems for most database algorithms (DeWI92a 
DeWI92b], mainly due to improper load balancing. Declustering aware algorithms provide one way 
to handle this [NICC92]. 

2.4 Parallel Query Compilation and Scheduling Layer 

Database query compilation for sequential machines provides the functionality of translating a high- 
level (declarative) query into an optimized sequence of relational algebra and record management level 
operations. For a parallel machine, the additional decisions of (i) determining the type and degree of 
parallelization, (ii) an estimation of resource requirements, and (iii) an initial assignment of resources, 
must be made (GANG92, SRJV93]. 

2.4.1 Requirements 

Descriptions of various NASA projects, including the Intelligent Information Fusion ( IIF ) system 
[ROEL91, CAMP90a], the Intelligent User Interface for Catalog Browsing r system [CROM89], etc., have 
identified that the interface between the applications, e.g. intelligent front-end of Figure 1, and the 
database of data products be a high-level one, e.g. SQL. Query compilation and scheduling for par- 
allel databases is currently an active research area [DeWI92a, WILS91, GANG92, SCHN90, HUA93, 
SRIV93, NICC93]. While detailed survey and comparisons are provided in [SRJV93, NICC93], the basic 
requirements for this layer are: 

• Translation from SQL to an internal form (not a research issue). 

• Optimizations performed on the internal form based on the desired objective, e.g. minimize work, 
minimize response time, etc., to generate a ’good* query execution plan. 
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• Determining the type(s) and degree of parallelization of the query plan. 

• Estimation of resource needs for a query plan to help resource managers during query execution. 

• Determining an initial resource allocation for the plan, which may potentially be modified during 
execution. 

2.4.2 Approach 

Our overall approach to query compilation is shown in Figure 4. It is a 2-phase approach, where in 
Phase 1 a compiler that optimizes SQL for sequential machines is used, which (heuristically) minimizes 
work. This is not a research issue since good sequential optimizers exist The output is fed to Phase 2 
which (i) parallelizes the sequential plan, (ii) estimates its resource needs, and (iii) generates an initial 
resource allocation schedule. The output of Phase 2 is a set of tasks schedulable on a parallel machine. 
An example input query, represented as a query graph, and its corresponding set of tasks, through 
t u, is shown in Figure 5. In each of the seven time slices, numbered 0 through 6, the total resources 
allocated for this query's execution are shared between the tasks allocated to the slice. Further details 
are in [NICC93]. While in general it is not true that the parallelization of a 'good' (or even the optimal) 
sequential plan will yield the best parallel plan, a 2-phase approach such as ours has the advantages of 
(i) drastically reducing the search space size, and (ii) leveraging off the existing technology in sequential 
optimization. We share the belief with [STON88, H0NG91, HONG92] that a 2-phase approach is a 
viable heuristic and worth a detailed investigation. 



We now briefly describe the key elements of our approach to query compilation and scheduling. 
Details are provided mainly in [NICC93] and some in [SRJV93]. Specifically, we propose (i) a parallel 
query plan representation, (ii) a new cost model to incorporate parallel execution, and (iii) heuristic 
search algorithms. 

Query Plan Representation: A parallel query plan can exploit the following kinds of parallelism: 

• Intra- operator parallelism : A relational operator, such as select, project or join, can be performed 
by multiple processors simultaneously. 

• Inter- operator parallelism : Different relational operators of a query, eg. different joins, can be 
performed in parallel by different (sets of) processors. 

• Pipelining: Different relational operators can be performed in a pipelined manner using separate 
(groups of) processors. The result of one is pipelined to the other. 
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In our model a parallel query plan is represented as a capacitated labeled ordered binary tree. The 
shape represents inter-operator parallelism, the orientation represents operand ordering, the node labeling 
represents intra-operator parallelism, the M (P) branch labeling represents materialization (pipelining) 
of results between operators, and the branch capacity represents the size of the main- memory (producer- 
consumer) buffer when materialization (pipelining) of intermediate results is being done. 

4 



Figures 

Figure 6 shows a plan for a query with four joins, i.e. J\, J$ and J \ % between five relations, 
i.e. R\ f Rj, #3, It 1 and £5. J\ has inter-operator parallelism with J? (and with J3). Operations J\ 
and J4 are on a root-leaf path and thus do not have inter-operator parallelism. The same holds between 
7 ), J 3 and J4. Since the branch between J\ and J4 is labeled with M , J\ must complete before J4 can 
begin. The same holds for J2 and J3. The branch between and J4 is labeled P, and thus the two are 
pipelined, with J4 beginning as soon as J3 has produced the first result tuple. The labels 4 , 4 , 6 and 6 
on 7 i, J4 , 7 j and ^3, respectively, represent the number of processors assigned to each. Note that the 
processors assigned to operators at the opposite ends of a branch labeled M are the same set, i.e. they 
first perform the child task and then proceed to the parent task. The processors on the opposite ends of 
a branch labeled P are distinct sets since the operations are pipelined. The 4 processors will first perform 
the join J \ and then J4. The 6 processors will first perform the join J 2 and then J$. The two processor 
sets will be working independently while performing the joins J\ and J?. While performing J3 and 7 *, 
the 6 processor set will be the producer while the 4 processor set will be the consumer. The capacity of 
2 on the branch ( J4, J3) means that the intermediate buffer is assigned 2 units of memory. A capacity of 
4 on the other branches indicates that each materialized intermediate result has been assigned 4 units of 
buffer space. Upon overflow the results must go to disk. Total system memory is 10 units. 

Cost Model for Parallel Query Plans: A cost model for parallel query plans requires (i) developing 
analytical cost expressions for individual operators such as select, project, join, etc., and (ii) combining 
the expressions for individuad operators to obtain costs for entire plans. Special care has to be paid in 
combining costs for operators executing in a parallel or pipelined manner. The two key components are: 

• Coat of Individual Operators: A number of simulation and experimental evaluations of parallel 
algorithms for relational operators exist [DeWI 90 , BARU88, SCHN 89 , FRIE 90 ]. For query opti- 
mization, however, an analytical parameterized cost model is needed. In addition to conventional 
parameters such as database size, query selectivity, indexes, algorithm used, etc., the cost of an 
operator depends on (i) its degree of parallelization, (ii) its resource allocation, (iii) parameters of 
the machine architecture, e.g. costs for unit processing, I/O, and communication operations, and 
(iv) data declustering. 
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• Combining Operator Costa: For parallel query processing the plan with total minimum work and the 
one with the shortest critical path may not be identical [GUST89]. Maximizing overall throughput 
in a multiprogrammed environment requires minimizing a query’s total work, while minimizing 
individual response time requires reducing the critical path. Calculating the critical path in a plan 
can be quite tricky as it needs to consider data flow dependencies and resource allocation [GANG92, 
SRIV93, NICC93]. 

In [SRIV93, NICC93] we describe the details of a cost model that addresses the above issues. It 
provides means of labeling nodes of the query plan tree with various cost metrics such as work, response 
time, etc., and lends itself to efficient bottom-up evaluation. 

Search Algorithm: It has been argued by [SWAM88,SWAM89,IOAN90] that exhaustive enumera- 
tion techniques such as dynamic programming [SELI79] are not likely to be successful for queries with 
large number of joins, i.e. 100 or so, and have proposed heuristic combinatorial optimization techniques 
such as Simulated Annealing, Iterative Improvement, and Successive Augmentation. The size of the 
search space for parallel query plans will be much larger than that for sequential ones [SRIV93]. This 
makes the need for efficient search algorithms of paramount importance. In [SRJV93] and [NICC93] 
we present two search heuristics to reduce the search space. The key elements of our approach are the 
following: 

• The join-tree output from the sequential optimizer is converted into an operator tree. 

• Decisions is made about which branches, i.e. intermediate results, will be pipelines and which will 
be materialized. 

• Resource estimation for various tasks is done. 

• Resource allocation for various tasks is carried out. 

• At each step some heuristic choices are made to reduce the search space size. 

We have built a prototype query optimizer and performed its initial evaluation [SRJV93, NICC93]. 
Figure 7 shows a schematic of our prototype optimizer. It is a customizable optimizer in the sense that 
it is table-driven and takes architectural parameters from a file as an input to its cost model. Thus, it is 
customizable to various architectures. 


3. Goals t Specific Research Issues 



3.1 Research Issues in the Record Management Layer 

For the record management layer, the following specific research problems must be addressed: 

• Evaluate the CMD approach with NASA data sets. 

• Based on above evaluation tune/modify CMD, and if need be create new declustering strategies for 
NASA’s data sets. 

• Enhance our approach to provide better declustering by including information about a core set of 
NASA application queries. Many applications often have such a set, and we would like to identify 
such a set for the reprocessing algorithms. 

• Since the relations are partially sorted on each dimension, its benefit on parallel external sorting 
algorithms needs to be examined. 

• CMD provides an implicit indexing because of partial ordering of various domains. How this affects 
and is complemented by explicit indices, e.g. tree or hash based, needs exploration. 

• Development of specialized indices for the parallel I/O system to speed-up the evaluation of aggre- 
gates [SRIV89], temporal selections [K0L089], etc. 

• Develop efficient parallel algorithms for loading large data files into relations in the PRDBMS, since 
this expected to be a frequent operation [PRAT93]. 

• Develop algorithms to perform operations along the temporal and spatial dimensions efficiently. 

3.2 Research Issues in the Parallel Relational Algebra Layer 

In this layer the following research issues must be addressed: 

• Evaluate our dcclustering awart join algorithm on NASA’s data sets. 

• Based on above evaluation tune/modify the join algorithm, and if need be create new ones, for 
NASA’s data sets. 

• Apply the declustering aware approach to other algorithms in the relational algebra layer, e g. 
union, difference, aggregation, etc. 

3.3 Research Issues in the Query Compilation Scheduling Layer 

Query compilation and scheduling is a wide open area of research today, and a number of issues 
remain open. Given the fact that it took almost a decade to get satisfactory sequential database query 
compilers, this is likely to be an area of active research for a few years. Specifically, the following research 
issues must be addressed: 

• Evaluate the effectiveness of our optimizer on some typical queries found in NASA applications. 

• Customize our prototype optimizer for a parallel architecture that NASA may be considering for 
building/acquiring a parallel DBMS on. 

• Evaluate and validate the optimizer cost model, which is one of the keys to building a successful 
optimizer (DeWI92a] 
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4. Conclusions 

In the past decade there has been a tremendous growth in the amount of data and resultant information 
generated by NASA’s operations and research projects. This growth is expected to continue in the future. 
Use of parallel computers, both processing and input-output, will be a key to solving the resultant data 
management problem. In this paper we have described the architecture of a parallel data management 
system which is based on visualizing data as points in space and query processing as geometric operations. 
The architecture is highly parallel and is quite generic, i.e. can be realized on a wide variety of parallel 
machines. We provided an overview of our results and pointed out a number of open research issues. 
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